Skip to content

revert pinned hostmem for pod comm#292

Open
AtlantaPepsi wants to merge 1 commit into
ROCm:candidatefrom
AtlantaPepsi:EGM
Open

revert pinned hostmem for pod comm#292
AtlantaPepsi wants to merge 1 commit into
ROCm:candidatefrom
AtlantaPepsi:EGM

Conversation

@AtlantaPepsi
Copy link
Copy Markdown
Contributor

Motivation

This is to add back the option of extended GPU memory in cross pod transfer.

Technical Details

Removed CheckPages() inside AllocateMemory() for pinned host memory. This would cause a silent fail on the node where it's allocated, and other nodes will hang at broadcast inside ExchangeMemory.

It's also not tested yet on either Nvidia or AMD platforms.

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings May 8, 2026 20:50
@AtlantaPepsi AtlantaPepsi requested a review from a team as a code owner May 8, 2026 20:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts pod-communication (fabric-handle) memory allocation to better support exporting extended (HOST_NUMA) memory across pods, and avoids a failure mode where NUMA page validation caused one rank to error while other ranks hung in collectives.

Changes:

  • Add a GetMemLocation() helper to correctly select DEVICE vs HOST_NUMA locations for pod-comm VMM allocations and access descriptors.
  • Remove CheckPages()/move_pages() validation for fabric-exportable host allocations (keep zeroing), with an explanatory comment.
  • Extend CUDA-compat macro aliases/undefs to include hipMemLocation and hipMemLocationTypeHostNuma.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1426 to +1431
// Determine location id
if (memDevice.memType == MEM_CPU_CLOSEST) {
location.id = GetClosestCpuNumaToGpu(memDevice.memIndex);
} else {
location.id = memDevice.memIndex;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants